Tested in R version 3.6.0.

Intro

Methods

Raw data were quality controlled with FastQC. The reads have high quality for all samples. Adapters were removed with Trimgalore and mapped with STAR[1] to the mm10 genome. STAR outputs counts for all genomic features/tags/genes (option: –quantMode GeneCounts). These are the raw data imported into this R pipeline which uses the edgeR[2] framework for RNAseq analysis (see ‘Differential Expression’ for more details). First, however, genes with low counts and unlikely to be translated (biologically meaningful) and also without enough counts for a reliable statistical judgement are removed. We require genes to have greater than (10/minimum sample library size in millions) counts per million.

The DGE analysis was performed using the edgeR framework. The TMM normalization for library size and composition bias was applied to the count-filtered data (not the VOOM normalized). This was followed by the models indicated in each contrast tab and tested using the quasi-likelihood t-test relative to a threshold (TREAT)[4] using a log2(1.5) threshold for differential expression between groups. P-values were then adjusted using the BH method. Genes were termed significant with FDR < 0.05. These significant genes were used in the Functional Enrichment using the ClusterProfiler R package. Both the Reactome Pathway and Gene Ontologies were tested for enrichment.

More detailed RNAseq explanation for users Most of the following text describing the RNAseq workflow was copied directly from the following publications to give a summary overview of the workflow:

EdgeR uses the negative binomial (NB) distribution to model the read counts for each gene in each sample. The dispersion parameter of the NB distribution accounts for variability between biological replicates. edgeR estimates an empirical Bayes moderated dispersion for each individual gene. It also estimates a common dispersion, which is a global dispersion estimate averaged over all genes, and a trended dispersion where the dispersion of a gene is predicted from its abundance. The estimation is robustified against potential outlier genes. For RNA-seq studies, the NB dispersions tend to be higher for genes with very low counts. The dispersion trend tends to decrease smoothly with abundance and to asymptotic to a constant value for genes with larger counts. From our past experience, the asymptotic value for the BCV tends to be in range from 0.05 to 0.2 for genetically identical mice or cell lines, whereas somewhat larger values (> 0.3) are observed for human subjects.

The NB model can be extended with quasi-likelihood (QL) methods to account for gene-specific variability from both biological and technical sources. Under the QL framework, the NB dispersion trend is used to describe the overall biological variability across all genes, and gene-specific variability above and below the overall level is picked up by the QL dispersion. The raw QL dispersion estimates are squeezed towards a global trend, and this moderation reduces the uncertainty of the estimates and improves testing power. The extent of the squeezing is governed by the value of the prior df estimated from the data. Large prior df estimates indicate that the QL dispersions are less variable between genes, meaning that strong EB moderation should be performed. Smaller prior df estimates indicate that the true unknown dispersions are highly variable, so weaker moderation towards the trend is appropriate. In general, if there are a large number of samples and/or high variability, then the QL dispersions are not squeezed very heavily from the raw values. If there are a low number of samples and/or low variability, then the dispersions will be squeezed more heavily.

The QL F-test is then used to find differentially expressed genes. While the likelihood ratio test is a more obvious choice for inferences with GLMs, the QL F-test is preferred as it reflects the uncertainty in estimating the dispersion for each gene. It provides more robust and reliable error rate control even when the number of replicates is small.

Programs:

Citations:

  • [1] Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., … Gingeras, T. R. (2013). STAR: Ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), 15–21. https://doi.org/10.1093/bioinformatics/bts635
  • [2] Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics (Oxford, England), 26(1), 139–140. https://doi.org/10.1093/bioinformatics/btp616

Load Data

This data powers the contrasts, if there are problems please contact me Found 6 samples in file samples.tsv
Var1 Freq
LKB1-/- 3
LKB1+/+ 3
Metadata
sample group genotype condition
RJ_RNA_1 LKB1_KO LKB1-/- LKB1-/-
RJ_RNA_11 LKB1_KO LKB1-/- LKB1-/-
RJ_RNA_21 LKB1_KO LKB1-/- LKB1-/-
RJ_RNA_6 LKB1_WT LKB1+/+ LKB1+/+
RJ_RNA_16 LKB1_WT LKB1+/+ LKB1+/+
RJ_RNA_26 LKB1_WT LKB1+/+ LKB1+/+
## The following samples have no meta data:
## 
## The following samples have no count data:
## Resolving...

Feature Counts & Filtering

## Raw Feature Counts Summary (Millions):
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.44   17.15   24.71   25.06   33.90   36.90
Raw Feature Counts
millions_of_counts sample
RJ_RNA_1 12.44183 RJ_RNA_1
RJ_RNA_11 18.75102 RJ_RNA_11
RJ_RNA_16 30.67760 RJ_RNA_16
RJ_RNA_21 34.97255 RJ_RNA_21
RJ_RNA_26 36.89899 RJ_RNA_26
RJ_RNA_6 16.61617 RJ_RNA_6

## Total tags:
##  50600
## Total tags with >0 counts:
##  24039

## Derived the cpm filter from a minimum of 10 counts in the smallest library
## Filters:
## minimum 0.8037402 cpm in  3 sample
## Filtered tags:
##  12302

Sample Clustering

VOOM Normalization For Plotting

We use a VOOM normalization to visualize the data. This transforms on a log2 scale.

Publication describing VOOM: Law, C. W., Chen, Y., Shi, W., & Smyth, G. K. (2014). Voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology, 15(2), 1–17. https://doi.org/10.1186/gb-2014-15-2-r29

PCA

Heatmap

Differential Expression

KO_vs_WT

FDR Filter: 0.05 .

logFC Filter 0.58 .

Diff. Gene Expr. Analysis Design:
groupLKB1_KO groupLKB1_WT
RJ_RNA_1 1 0
RJ_RNA_6 0 1
RJ_RNA_11 1 0
RJ_RNA_16 0 1
RJ_RNA_21 1 0
RJ_RNA_26 0 1
Contrast Matrix:
KO_vs_WT = groupLKB1_KO - groupLKB1_WT
groupLKB1_KO 1
groupLKB1_WT -1
Normalizing for library size and composition biases in the samples using trimmed mean of M-values (TMM) method
group lib.size norm.factors
RJ_RNA_1 1 12424902 1.0014461
RJ_RNA_6 1 16599141 0.9836468
RJ_RNA_11 1 18728147 1.0032096
RJ_RNA_16 1 30643044 0.9778989
RJ_RNA_21 1 34919959 1.0551082
RJ_RNA_26 1 36860477 0.9807325

In the MDS plot, the distance between each pair of samples can be interpreted as the leading log-fold change between the samples for the genes that best distinguish that pair of samples. By default, leading fold-change is defined as the root-mean-square of the largest 500 log2-fold changes between that pair of samples.

Common dispersion: 0.01953263 .

Biological Coefficient of Variation: 0.1397592 .

Typical values for the common BCV (square-root-dispersion) for datasets arising from well-controlled experiments are 0.4 for human data, 0.1 for data on genetically identical model organisms or 0.01 for technical replicates.

Total tags: 12302 .

Tags with FDR<0.05: 2689 .

Differentially expressed tags (FDR<0.05 & logFC>0.58): 2689 .

Top 25 Differentially Expressed Genes By Adjusted Pvalue
logFC unshrunk.logFC logCPM PValue FDR ens_gene ext_gene predicted_function chromosome_name start_position end_position transcript_length entrezgene_id
-9.962930 -9.980276 7.709535 0 1e-07 ENSMUSG00000023092 Fhl1 four and a half LIM domains 1 [Source:MGI Symbol;Acc:MGI:1298387] X 56731787 56793346 2356 14199
6.569521 6.572131 7.026823 0 2e-07 ENSMUSG00000047250 Ptgs1 prostaglandin-endoperoxide synthase 1 [Source:MGI Symbol;Acc:MGI:97797] 2 36230426 36252272 2867 19224
10.067015 10.131348 5.933270 0 2e-07 ENSMUSG00000031740 Mmp2 matrix metallopeptidase 2 [Source:MGI Symbol;Acc:MGI:97009] 8 92827291 92853420 3078 17390
5.299084 5.301919 5.632772 0 2e-07 ENSMUSG00000061353 Cxcl12 chemokine (C-X-C motif) ligand 12 [Source:MGI Symbol;Acc:MGI:103556] 6 117168535 117181367 1878 20315
-7.525936 -7.546090 5.055786 0 2e-07 ENSMUSG00000046572 Zfp518b zinc finger protein 518B [Source:MGI Symbol;Acc:MGI:2140750] 5 38668484 38684826 6804 100515
5.932290 5.935486 6.089325 0 2e-07 ENSMUSG00000033436 Armcx2 armadillo repeat containing, X-linked 2 [Source:MGI Symbol;Acc:MGI:1914666] X 134804145 134809221 3007 67416
-4.376064 -4.376877 6.526717 0 2e-07 ENSMUSG00000030220 Arhgdib Rho, GDP dissociation inhibitor (GDI) beta [Source:MGI Symbol;Acc:MGI:101940] 6 136923655 136941899 1153 11857
-11.776919 -11.894306 6.817376 0 2e-07 ENSMUSG00000001819 Hoxd13 homeobox D13 [Source:MGI Symbol;Acc:MGI:96205] 2 74668310 74671599 2483 15433
6.386351 6.387273 8.347684 0 2e-07 ENSMUSG00000026249 Serpine2 serine (or cysteine) peptidase inhibitor, clade E, member 2 [Source:MGI Symbol;Acc:MGI:101780] 1 79794197 79861180 2210 20720
-5.605983 -5.608306 6.241398 0 2e-07 ENSMUSG00000000184 Ccnd2 cyclin D2 [Source:MGI Symbol;Acc:MGI:88314] 6 127125162 127152193 6330 12444
5.570574 5.576739 4.779429 0 2e-07 ENSMUSG00000027339 Rassf2 Ras association (RalGDS/AF-6) domain family member 2 [Source:MGI Symbol;Acc:MGI:2442060] 2 131989415 132030258 8233 215653
-4.853919 -4.855760 5.822455 0 2e-07 ENSMUSG00000029648 Flt1 FMS-like tyrosine kinase 1 [Source:MGI Symbol;Acc:MGI:95558] 5 147561604 147726011 6895 14254
6.707474 6.720459 4.853428 0 2e-07 ENSMUSG00000044337 Ackr3 atypical chemokine receptor 3 [Source:MGI Symbol;Acc:MGI:109562] 1 90203980 90216751 3065 12778
9.480325 9.594091 4.537504 0 2e-07 ENSMUSG00000033361 Prrg3 proline rich Gla (G-carboxyglutamic acid) 3 (transmembrane) [Source:MGI Symbol;Acc:MGI:2685214] X 71962624 71972722 344 208748
4.293728 4.295360 5.422149 0 2e-07 ENSMUSG00000028517 Plpp3 phospholipid phosphatase 3 [Source:MGI Symbol;Acc:MGI:1915166] 4 105157347 105232764 3159 67916
8.871311 8.873712 9.440891 0 2e-07 ENSMUSG00000021614 Vcan versican [Source:MGI Symbol;Acc:MGI:102889] 13 89655312 89742509 8332 13003
6.264677 6.266776 7.043514 0 2e-07 ENSMUSG00000074743 Thbd thrombomodulin [Source:MGI Symbol;Acc:MGI:98736] 2 148404466 148408188 3723 21824
7.429860 7.439473 6.017043 0 3e-07 ENSMUSG00000018417 Myo1b myosin IB [Source:MGI Symbol;Acc:MGI:107752] 1 51749765 51916071 5052 17912
4.218141 4.218384 8.102869 0 3e-07 ENSMUSG00000031375 Bgn biglycan [Source:MGI Symbol;Acc:MGI:88158] X 73483602 73495933 828 12111
6.755702 6.758677 7.016033 0 3e-07 ENSMUSG00000036334 Igsf10 immunoglobulin superfamily, member 10 [Source:MGI Symbol;Acc:MGI:1923481] 3 59316735 59344394 8484 242050
5.551056 5.559305 4.346965 0 3e-07 ENSMUSG00000041444 Arhgap32 Rho GTPase activating protein 32 [Source:MGI Symbol;Acc:MGI:2450166] 9 32116136 32268446 13568 330914
10.469977 10.552187 5.993284 0 3e-07 ENSMUSG00000030671 Pde3b phosphodiesterase 3B, cGMP-inhibited [Source:MGI Symbol;Acc:MGI:1333863] 7 114415281 114539251 6573 18576
5.300661 5.301928 6.796958 0 4e-07 ENSMUSG00000022425 Enpp2 ectonucleotide pyrophosphatase/phosphodiesterase 2 [Source:MGI Symbol;Acc:MGI:1321390] 15 54838901 54952892 3095 18606
-5.148191 -5.152652 4.865875 0 4e-07 ENSMUSG00000024512 Dynap dynactin associated protein [Source:MGI Symbol;Acc:MGI:1922827] 18 70240429 70244582 1225 75577
-3.760112 -3.760807 6.131435 0 4e-07 ENSMUSG00000001123 Lgals9 lectin, galactose binding, soluble 9 [Source:MGI Symbol;Acc:MGI:109496] 11 78962974 78984946 1520 16859

Functional Enrichment

Hypergeometric Test

GSEA testing was performed with ClusterProfiler’s enrichGO function.

GO: Biological Process

GO: Molecular Function

GO: Cellular Compartment

GSEA

GSEA testing was performed with ClusterProfiler’s GSEA function.

Reactome Pathways

KEGG Pathways

K-means Clustering

[1] “Significantly differentially expressed genes are k-means clustered (k=6).” [1] “GO category enrichment is performed on each of these clusters.”